Team 20: HCDR

Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Commonly Used Functions for the Project

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

The Other datasets

Exploratory Data Analysis and Visual EDA

Applicaton Train File

Description of Numerical Features in Application Train

Describing All Features of Application Train

Missing Values in Application Train

Target Value Distribution

Observations from Missing Features graph

We observed that 67 features are missing al least 1 value. 41 features are missing at least 50% values

The target distribution is higly skewed and approximately 91% data has target variable 0

Feauture Analysis of Application Train Features

Univariate Analysis on NAME_CONTRACT_TYPE

1. Analysis on NAME_CONTRACT_TYPE

NAME_CONTRACT_TYPE : Contract product type of the loan application

Observations The number of Cash Loans taken by individuals is more than the revolving type loans.

Univariate Analysis on Gender

2. Univariate Analysis on Gender

CODE_GENDER : The Gender of the Applicant

Univariate Analysis on Possesion of Vehicle

Observations

1) The number of loans taken by females is much more than that of Males.

2) Loan applications of women is approximately equal to 200k whereas male applicants have 100K applications approximately

3) Men are capable to pay loans in a better fashion compared to women.

4) XNA gender is where the gender of applicant is not specified. These values can be imputed later in the pipelines

Univariate Analysis on Possesion of House

3. Univariate Analysis on Possesion of Vehicle

FLAG_OWN_CAR : Does the applicant own a car or not ?

Observations

1) Most applicants don't have a car.

2) The applicants who don't own a car have very little difference is the repayment of loans.

Univariate Analysis on Number of Children

5. Univariate Analysis on Number of Children CNT_CHILDREN : The number of children which the applicant has

Observations

  1. Applicants with no children take larger amount of loans
  2. Applicants with no children are more capable of repayment

Univariate Analysis on Dependents

Observations

  1. A lot of applicants who came unaccompanied took more loans
  2. Unaccompanied Applicants have more capability of paying loans

Univariate Analysis on Income Type

Observations

  1. The people who are working took more loans and were more capable of repayment compared to other types

Univariate Analysis on Family Status

Observations

  1. Married People took more loans
  2. Married people had more capability of repayment

Observations from Above Plots

Analysis on Numerical Features : AMT_CREDIT

Observations Credit Amount taken from most loans is less than 10 lakhs

Analysis on Numerical Features : AMT_ANNUITY

Observations Most people pay below 50K in loans

Analysis on Numerical Features : AMT_GOODS_PRICE

Observations Most loans are approved when the product to be purchased is below 10lakhs

Client Age Distribution

Observations People who take more loans are in the age range : 25 to 65

Client Occupation Distributon

Observations Distribution of Occupation of Applicants are displayed : Laborers have the most amount of applications

Analysis on External Sources : EXT_SOURCE1, EXT_SOURCE2, EXT_SOURCE3

Observations 1) The Probabiltity Distribution Plots of EXT_SOURCE1, EXT_SOURCE2 and EXT_SOURCE3 : which are the external sources of income

Analysis of AMT_CCREDIT vs AMT_ANNUITY

Scatterplot Analysis of AMT_CCREDIT vs AMT_ANNUITY

Bureau and Bureau Balance Files EDA

Missing Information and Plots

Observations

7 features in Bureau are missing values

Categorical Feature : Credit Active and Credit type

Categorical Features : Credit Active and Credit Type

CREDIT_ACTIVE : Status of Credit Bureau Reported Credits CREDIT_TYPE : Type of Credit Bureau Credit

Previous Application File EDA

Missing Values in Previous Applications

EDA for Previous Application

EDA of CREDIT CARD BALANCE

Missing Values in Credit Card Balance

EDA of POS_CASH_BALANCE

Missing Values in POS_CASH_BALANCE

EDA of Installment Payments

Missing Values in Installment Payments

Removing Null Values from Aplication train

Feature Aggregation Class for Aggregation using Pipeline

Merging Previous Application Dataset with Application Train|Test

Feature Engineering of Previous Application

1) pa_APPLICATION_CREDIT_DIFF= Difference of how much client asked on the previous application and final credit
2) pa_APPLICATION_CREDIT_RATIO = Ratio of how much client asked on the previous application and final credit
3) pa_CREDIT_TO_ANNUITY_RATIO = Ratio of credit amount and annuity amount of the previous applications
4) pa_DOWN_PAYMENT_TO_CREDIT = Down payment as a percentage of total credit of previous application

Merging Credit Card Balance Dataset with Application Train|Test

Feature Engineering of Credit card balance

1) cc_LIMIT_USE = total used credit as a percentage of total limit
2) cc_PAYMENT_DIV_MIN = Ratio of credit payment of the month and minimum installment for the month
3) cc_LATE_PAYMENT = Late payment flag for the month
4) cc_DRAWING_LIMIT_RATIO = Ratio of total drawings of the month with credit limit amount

Merging POS_CASH_Balance Dataset with Application Train|Test

INSTALLMENT_PAYMENTS Pipeline

Feature Engineering of Installment Payment

1) ins_PAID_OVER_AMOUNT = Excess amount paid over prescribed installment amount
2) ins_DBD = The difference between the actual Insatllment payment and scheduled payment date
3) ins_LATE_PAYMENT = Late installment payment flag
3) ins_INSTALMENT_PAYMENT_RATIO =ratio of actual installment payment and prescribed installment amount
4) ins_LATE_PAYMENT_RATIO = Installment payment ratio of late payments
5) ins_SIGNIFICANT_LATE_PAYMENT = Installment payment ratio of late payments where payments for late payments more than 5% of the time

Bureau and Bureau Balance

FEATURE ENGINEERING of Bureau

Features Added from Bureau

1) bur_CREDIT_DURATION = The difference between the days before the current application when client applied for Credit Bureau credit and the remaining duration of CB Credit in days
2) bur_ENDDATE_DIFF = The difference between the actual closed credit in Credit Bureau and remaining duration of credit in days
3) but_UPDATE_DIFF = Difference between the credit end date and last update date
3) bur_DEBT_PERCENTAGE =Total debt as a percentage of credit
4) DEBT_CREDIT_DIFF = Difference in total credit and total debt
5) CREDIT_TO_ANNUITY_RATIO = ratio of total credit to Annuity
6) DEBT_TO_ANNUITY_RATIO = ratio of total debt to Annuity
7) CREDIT_OVERDUE_DIFF = Difference in total credit and Overdue amount
8) appcount = total count of previous loans in bureau history

Joining all the Secondary Tables with Primary Tables : Application Train/ Test

Join with Application test

Correlations of Engineered features with Target variable

Correlations of top50 Numeric features with Target variable

Pipleine

OHE when previously unseen unique values in the test/validation set

Train, validation and Test sets (and the leakage problem we have mentioned previously):

Let's look at a small usecase to tell us how to deal with this:

This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

Here is a example that in action:

# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

Baseline Model

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model

Metrics Used for comparison of models:

image.png

PRECISION=TPTP+FP

RECALL = TPTP+FN

F1 = 2 PRECISIONRECALLPRECISION + RECALL

Hyperparameter tuning for models

Logistics regression model

The logistic regression is passed through the Grid Search CV to find us the best parameters, the parameters for the pramas grid for the initial search in GridSearchCV are : CSList=[10,100] Regulars=[‘l1’,’l2’] Solvers=[‘liblinear’]

Best parameters:

The best parameters decided by the GridSearchCV for Cross Validation= 5 is {‘linearC’:100, ‘linear_Penalty’: ‘l2’, ‘linearsolver’: ‘liblinear’}

The metrics for the best parameters are

Train Accuracy: 0.9192 Test Accuracy: 0.9193 Train AUC: 0.7421 Test AUC: 0.7431 Train F1 Score: 0.8828 Test F1 Score: 0.8828

XGboost model

Best params:

{'xgblearning_rate': 0.1, 'xgbmax_depth': 5, 'xgb__min_child_weight': 3}

Random Forest model

The Random Forest classification is passed through GridSearchCV to find the best parameters, the parameters for the pramas grid for the initial search in GridSearchCV are : ‘Rmfbootstrap’: [True] ‘Rmfmax_depth’ : [10,20] ‘Rmfmax_features’ : [2,3] ‘Rmfn_estimators’ : [100,200]

Best parameters: The best parameters decided by the GridSearchCV for Cross Validation= 3 are: {'rmfbootstrap': True, 'rmfmax_depth': 20, 'rmfmax_features': 3, 'rmfn_estimators': 100}

Saving the models

Resampling of the data for balancing target variabale

Linear model with best params for balanced data

XGboost model with best params for balanced data

Random Forest model with best params for balanced data

Loss Functions

Cross Entropy Loss: $\frac{-1}{N}\sum_{i=1}^{N}y_i.log(p(y_i)) + log(1-p(y_i))$

Kaggle Submissions:

sub_1.PNG

Neural Networks

image.png

Kaggle Submission

sub2.PNG

Write-up

Abstract:

Home Credit Default Risk is a project where we determine the credit worthiness of people that have applied for the loans. In previous phases, we had completed basic EDA, Feature Engineering and ran the baseline model for logistic regression and the hyperparameter tuning for XGBoost model. In this Phase, we have significantly improved our project. We have updated the EDA, implemented robust Feature engineering for all dataset files, and did experimental analysis for hyper-parameter tuning for Logistic Regression, XGBoost and Random Forest Models. We conducted experiments using both original imbalanced data as well as resampled data. After comparison we found out that the XGBoost model with parameters: { learning_rate: 0.1, max_depth=5, min_child_weight=3 } was the best model, using the model performance criteria of Test AUC Score (0.7427). For the deep learning Pytorch model, we used a feed-forward MLP with two hidden layers of 128 and 64 neurons each. We used a sigmoid activation function and SGD optimizer with cross entropy as loss function. The model achieved the test accuracy of 40%. The best Kaggle submission that we obtained was from XGBoost model with the Private Score 0.64788 and Public Score of 0.65231.

Team Members:

Team Information Bhushan Patil Vaibhav Vishwanath Gavin Henry Lewis Prathamesh Deshmukh

bpatil@iu.edu! vavish@iu.edu gavlewis@iu.edu pdeshmukh@iu.edu

Team%20Members_%20FP_Phase2.png

Project Description

Data Description

Background Home Credit Group Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities. While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic. The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders. Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018). While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

  1. application_train/application_test: the main training and testing data with information about each loan application at Home Credit. Every loan has its own row and is identified by the feature SK_ID_CURR. The training application data comes with the TARGET indicating 0: the loan was repaid or 1: the loan was not repaid. The target variable defines if the client had payment difficulties meaning he/she had late payment more than X days on at least one of the first Y installments of the loan. Such case is marked as 1 while other all other cases as 0.
  2. bureau: data concerning client's previous credits from other financial institutions. Each previous credit has its own row in bureau, but one loan in the application data can have multiple previous credits.
  3. bureau_balance: monthly data about the previous credits in bureau. Each row is one month of a previous credit, and a single previous credit can have multiple rows, one for each month of the credit length.
  4. previous_application: previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
  5. POS_CASH_BALANCE: monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
  6. credit_card_balance: monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
  7. installments_payment: payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.

Part B: Tasks to be tackled during this phase

  1. Robust Feature Engineering on all data files
  2. Exploring different models with hyperparameter tuning and selection of best model based on performance metrics such as AUC and F1.
  3. Optimization of input data for efficient memory usage and avoiding kernel crashes.
  4. Implementation of MLP model and tensorboard for visualization of model training.
  5. Kaggle submission for both the best model and the MLP model.

Workflow Diagram For entire Project

Phase3.drawio (1).png

Neural networks

Fig: Sample visualization of Neural Network Architecture Designed

Architecture Of Neural Network:

The neural network has 4 layers: Layer 1: 188 neurons (Input Layer)
Layer 2: 128 neurons Activation function: Sigmoid Layer 3: 64 neurons Activation function: Sigmoid Layer 4: 1 neuron (Output Layer) Activation function: Sigmoid

Hyper Parameters:

Learning Rate= 0.01 Batch Size =400 Optimizer= SGD Loss Function= Cross Entropy Loss (BCEwithLogitsLoss())

We trained the model for 500 epochs and the training loss converged to 0.69 starting with 0.74. The accuracy was fluctuating in the range of 30-50. We achieved the test accuracy of 40% on the trained model. Since the model was run only for resampled balanced data with only 50,000 training samples. This is why were are getting lower accuracies, and found out that the traditional algorithms were performing better. Tensor board Dashboard for training visualization:

Data Leakage Control (HCDR):

We performed feature engineering on secondary datasets and merged with application train and application test. For feature selection from application train, we selected top 50 numeric features based on the correlation with target variable. All the numeric features as well as the engineered features were part of the final dataset. We used pipelines to avoid the data leakage during preprocessing of numeric and categorical features. All the models were built with data pipeline + estimator as a single pipeline for Cross Validation purposes.

Modelling Pipelines:

Families of Input Features: Inputs used for the modelling primarily consists of following:

Engineered Features:

Total count of engineered features is 24. This is the set of input features derived from data files other than application_train. The top 5 most correlated features are:

  1. cc_LIMIT_USE: total used credit as a percentage of total limit
  2. cc_DRAWING_LIMIT_RATIO: Ratio of total drawings of the month with credit limit amount
  3. ins_INSTALMENT_PAYMENT_RATIO: ratio of actual installment payment and prescribed installment amount
  4. ins_LATE_PAYMENT_RATIO : Installment payment ratio of late payments
  5. bur_UPDATE_DIFF: Difference between the credit end date and last update date

Numeric Features:

We have considered top 50 correlated numeric features. These are the set of numeric features from application_train as well as aggregated features from secondary datasets. Top 5 highly correlated features are:

  1. EXT_SOURCE_3
  2. EXT_SOURCE_2
  3. cc_CNT_DRAWINGS_ATM_CURRENT_sum
  4. cc_CNT_DRAWINGS_ATM_CURRENT_mean
  5. cc_CNT_DRAWINGS_ATM_CURRENT

Categorical Features:

We considered the categorical features from the application train dataset, the total count of categorical features is 16.

Hyperparameter Settings: Logistic Regression Cslist = [10,100] regulars = ['l1','l2'] solvers=['liblinear'] paramsgrid = {'linearC':Cslist, 'linearpenalty': regulars, 'linear__solver': solvers}

XGBoost params={ "xgblearning_rate" : [0.05, 0.10] , "xgbmax_depth" : [ 3,5], "xgb__min_child_weight" : [ 1, 3, 5] }

Random Forest: param_grid = { 'rmfbootstrap': [True], 'rmfmax_depth': [10, 20], 'rmfmax_features': [2, 3], 'rmfn_estimators': [100, 200] }

MLP: Learning Rate= 0.01 Batch Size =400 Optimizer= SGD Loss Function= Cross Entropy Loss (BCEwithLogitsLoss())

Loss Functions:

L1 Regularization

$\sum_{i=0}^{N}(y_i-\sum_{j=0}^{M}x_{ij}W_j)^2 + \lambda\sum_{j=0}^{M}|W_j|$

L2 Regularization $\sum_{i=0}^{N}(y_i-\sum_{j=0}^{M}x_{ij}W_j)^2 + \lambda\sum_{j=0}^{M}W_j^2$

Experiments Conducted:

  1. Baseline Experiment: Baseline model is a logistics regression model with no hyperparameter tuning to establish a baseline. Baseline model performed with the accuracy of 91.93%,

exp_name Train Acc Test Acc Train AUC Test AUC Baseline Model 0.9193 0.9194 0.7418 0.7432

  1. Logistic Regression model with GridsearchCV: We created a modelling pipeline for preprocessing + logistic regressor, this was then passed on to a GridSearchCV with a set of hyperparameters and CV=5. Test score was calculated on the best parameters. Best parameters: The best parameters decided by the GridSearchCV for Cross Validation= 5 are {‘linearC’:100, ‘linear_Penalty’: ‘l2’, ‘linearsolver’: ‘liblinear’} Model score for different evaluation criterion exp_name Train Acc Test Acc Train AUC Test AUC Train F1 Test F1 Logreg crossvalidation best 0.9192 0.9193 0.7421 0.7431 0.8828 0.8828

Model did not perform particularly well with unbalanced label(Target=1)

  1. XGBoost model with GridsearchCV: XGboost model was created using the same data processing pipeline with XGBClassifier as a estimator. Model was run for the set of hyperparameters and then the test score was calculated using the best parameters given by the GridsearchCV Hyperparamete "xgblearning_rate" : [0.05, 0.10] , "xgbmax_depth" : [ 3,5], "xgbmin_child_weight" : [ 1, 3, 5] Best Parameters: {'xgblearning_rate': 0.1, 'xgbmax_depth': 5, 'xgbmin_child_weight': 3}

exp_name Train Acc Test Acc Train AUC Test AUC Train F1 Test F1 XGBoost crossvalidation best 0.92 0.9199 0.7652 0.7466 0.8833 0.883

This model too performed poorly on the unbalanced label(Target=1) but the results bested the Logistics regressor with the test set F1 score of 0.883

  1. Random Forest model with GridsearchCV: Random forest model was created using the same data processing pipeline with RandomForestClassifier as an estimator. Model was run for the set of hyperparameters and then the test score was calculated using the best parameters given by the GridsearchCV Hyperparamete: param_grid = { 'rmfbootstrap': [True], 'rmfmax_depth': [10, 20], 'rmfmax_features': [2, 3], 'rmfn_estimators': [100, 200] }

Best parameters: {'rmfbootstrap': True, 'rmfmax_depth': 20, 'rmfmax_features': 3, 'rmfn_estimators': 100}

exp_name Train Acc Test Acc Train AUC Test AUC Train F1 Test F1 Random Forest crossvalidation best 0.9223 0.9193 0.9065 0.7105 0.8877 0.8806

Random forest regressor performed well on the training set but failed to show the similar performance on the test set, which indicates a slight overfitting.

Data sampling for balancing the target variable:

We performed data sampling using sklearn’s resample module with the ratio of 2:3 for ‘1’ and ‘0’ target value. We re-ran all the models on best parameters on the new dataset and recalculated the score on evaluation criteria for test set.

Final Model Tuned:

Based on the AUC and F1 score on the test dataset, our final tuned model was the XGBoost model with following hyperparameters: {'xgblearning_rate': 0.1, 'xgbmax_depth': 5, 'xgb__min_child_weight': 3}

Results and Discussion:

Following table summarizes the model performances before and after data balancing: exp_name Train Acc Test Acc Train AUC Test AUC Train F1 Test F1 Logreg crossvalidation best 0.919 0.919 0.742 0.743 0.883 0.883 XGBoost crossvalidation best 0.920 0.920 0.765 0.747 0.883 0.883 Random Forest crossvalidation best 0.922 0.919 0.907 0.711 0.888 0.881 Logreg balanced best 0.795 0.797 0.743 0.743 0.833 0.834 XGB balanced best 0.792 0.790 0.762 0.744 0.831 0.829 RF balanced best 0.835 0.798 0.888 0.728 0.865 0.834

Loss based on models trained on balanced data: exp_name Train loss Test loss Logreg Log-loss 7.065 7.002 Randomforest Log-loss 5.713 6.982 XGB Log-loss 7.198 7.258

• In this phase, we have improved on the Phase 2 submission by improving our EDA and implementing additional feature engineering on all the secondary datasets. • We also performed hyperparameter tuning on different models. • We improved on the model evaluation criteria as accuracy was giving a false representation of the goodness of fit. We have used F1 score and AUC as our primary model evaluators • We also implemented the Pytorch Deep Learning Model with 2 hidden layers. • We also did Kaggle Submissions for the Random Forest, XGBoost and the Neural Network model. Public Kaggle Scores XGBoost 0.65 Random Forest 0.64 MLP 0.50

Conclusion:

The aim of the project is to determine individuals who are capable of repaying the loans. Our Machine Learning model is able to predict whether an individual should be given a loan or not on the basis of the applicant’s previous applications, credit bureau history, payment installments and other primary features such as sources of income, number of family members, dependents, etc. All the machine learning models trained with imbalanced data performed poorly for target value 1. Model was retrained using resampled data and predictions for Target value 1 significantly improved as evident from the confusion matrices. The Deep Learning Model which was created did not perform on par with the traditional machine learning models as we trained the model with a smaller subset of data (resampled). Our best performing model was the XGBoost Model with Test AUC score of 74.7% and F1 score of 88.3%.The worst performing model was the MLP model.The best performing model had a Kaggle score of 65%.

References

https://medium.com/analytics-vidhya/home-credit-default-risk-part-1-business-understanding-data-cleaning-and-eda-1203913e979c

https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826

https://www.kaggle.com/c/home-credit-default-risk/discussion/57750

https://www.kaggle.com/c/home-credit-default-risk/discussion/57912